2019-04-27

Background

High-Performance Hardware costs declining

  • August 2017: Advanced Microdevices (AMD) x86 Core “Zen” architecture released

  • Ryzen retail, Threadripper High-Performance, and EPIC Server product lines.

  • Similar Performance as Intel, but at a deep discount.

  • Bottom Line for Data Peeps:
    • Twice the core counts for less money
    • Or more CPU for the same money

https://www.amd.com/en/technologies/zen-core

Early adoption looks good

  • Market Response ispositive: ZEN product lines are gaining acceptance among Graphics Designers, Gamers, and Data Professionals

Taking a chance on AMD

  • Bought a Threadripper 1950X 3.4 GHz 16-Core Processor

It’s ALIVE

Testing and Benchmarking

Benchmarking

Using the bencharkme package by Colin Gillespie.

library(benchmarkme)
  • Contains functions which run matrix algebra computations on random data.

  • Matrix algebra computations are the core of statistical/Machine Learning models.

  • Contains crowd-sourced benchmarks from other useRs for comparison.

Benchmarks based heavily on the R script by Simon Urbanek & Doug Bates:

http://r.research.att.com/benchmarks/R-benchmark-25.R

Benchmark: General Programming

  • 3,500,000 Fibonacci numbers calculation (vector calc).

  • Creation of a 3500x3500 Hilbert matrix (matrix calc).

  • Grand common divisors of 1,000,000 pairs (recursion).

  • Creation of a 1600x1600 Toeplitz matrix (loops).

  • Escoufier’s method on a 60x60 matrix (mixed)

Benchmark: Matrix Calculation

  • Creation, transpose., deformation of a 2500x2500 matrix.

  • 2500x2500 normal distributed random matrix ^1000.

  • Sorting of 7,000,000 random values.

  • 2500x2500 cross-product matrix (b = a’ * a)

  • Linear regression over a 3000x3000 matrix.

Benchmark: Matrix Functions

  • FFT over 2,500,000 random values.

  • Eigenvalues of a 640x640 random matrix.

  • Determinant of a 2500x2500 random matrix.

  • Cholesky decomposition of a 3000x3000 matrix.

  • Inverse of a 1600x1600 random matrix.

base-R 1 core: General Programming

## You are ranked 12 out of 163 machines.

base-R 1 core: Matrix Calculation

## You are ranked 4 out of 162 machines.

base-R 1 core: Matrix Functions

## You are ranked 10 out of 162 machines.

base-R 8 cores: General Programming

## You are ranked 2 out of 5 machines.

base-R 8 cores: Matrix Calculation

## You are ranked 1 out of 5 machines.

base-R 8 cores: Matrix Functions

## You are ranked 4 out of 5 machines.

base-R results summary

  • General Programming Benchmarks are excellent!

  • Linear Algebra Calculations are excellent!

  • Linear Algebra Functions lagging severely when run in Parallel!??

The goal of this entire build was to run models in parallel, so this is no good!

Linear Algebra Functions in R

base-R interfaces with BLAS (Basic Linear Algebra Subprograms) routines that provide standard building blocks for performing linear algebra operations.

  • scalar multiplication

  • dot products

  • linear combinations

  • matrix operations

  • Written in both C and FORTRAN

Papers and History here: http://www.netlib.org/blas/

Let's Try another BLAS!

OpenBLAS Matrix Functions

## You are ranked 2 out of 5 machines.

OpenBLAS Matrix Functions

## You are ranked 2 out of 7 machines.

From BLAS/LAPACK to BLIS/libFLAME

Poking around the internet and researching various BLAS libraries led me to BLIS and the FLAME project!

BLIS/libFlame are high performance dense linear algebra libraries, each addressing a layer in the linear algebra software stack.

Primarily developed and maintained by individuals in the Science of High-Performance Computing (SHPC) group in the Institute for Computational Engineering and Sciences at The University of Texas at Austin.

https://www.cs.utexas.edu/~flame/web/

BLIS/libFLAME

BLIS is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries. The framework was designed to isolate essential kernels of computation that, when optimized, enable optimized implementations of most of its commonly used and computationally intensive operations. Build BLIS from source on Github here: https://github.com/flame/blis

libFLAME is a high performance dense linear algebra library that is the result of the FLAME methodology for systematically developing dense linear algebra libraries. The FLAME methodology is radically different from the LINPACK/LAPACK approach that dates back to the 1970s, but is backwards compatible with them. Build libFLAME from source on Github here: https://github.com/flame/libflame/

The best part is: They really ROCK!

BLIS/libFLAME 8 core: M Functions

## You are ranked 1 out of 5 machines.

BLIS/libFLAME 16 cores: M Functions

## You are ranked 1 out of 7 machines.

BLIS/libFLAME 16 cores: M Functions

BLIS/libFLAME places \(1st\) of 7 with a time of \(10.66\) seconds, beats the next submission by a factor of 5, and the last by a factor of 20.

No Intel submissions with this many cores to benchmark against.

time cpu ram sysname release cores
10.663 AMD Ryzen Threadripper 1950X 16-Core Processor NA Linux 4.19.0-041900-generic 16
50.583 AMD Ryzen Threadripper 1950X 16-Core Processor NA Windows >= 8 x64 16
159.536 AMD Ryzen Threadripper 1950X 16-Core Processor 33.664 Linux 4.19.9-041909-generic 16
160.229 AMD Ryzen Threadripper 1950X 16-Core Processor 33.664 Linux 4.19.9-041909-generic 16
164.089 AMD Ryzen Threadripper 1950X 16-Core Processor 33.664 Linux 4.19.9-041909-generic 16
181.100 AMD Ryzen Threadripper 1950X 16-Core Processor 33.664 Linux 4.19.9-041909-generic 16
205.805 AMD Ryzen Threadripper 1950X 16-Core Processor 33.664 Linux 4.19.9-041909-generic 16

Fast Linear Algebra functions achieved!

On a single core, base-R BLAS is somewhat faster by 0.49 seconds, a factor of 1.37.

  • BLAS Single 1.34 seconds.

  • BLIS Single 1.83 seconds.

But on 8 cores, BLIS becomes much faster by 40.41 seconds, a factor of 7.83.

  • BLAS 8-core 46.32 seconds.

  • BLIS 8-core 5.91 seconds.

On 16 cores, BLIS is even faster yet by 95.02 seconds, a factor of 9.91.

  • BLAS 16-core 105.69 seconds.

  • BLIS 16-core 10.66 seconds.

Resources to switch BLAS to BLIS!

Try BLIS! Thank You